Skip to content

Auto-detect audio format in OpenAISpeechToTextClient#7575

Open
jozkee wants to merge 1 commit into
mainfrom
issue-7543
Open

Auto-detect audio format in OpenAISpeechToTextClient#7575
jozkee wants to merge 1 commit into
mainfrom
issue-7543

Conversation

@jozkee

@jozkee jozkee commented Jun 16, 2026

Copy link
Copy Markdown
Member

When the audio stream is not a FileStream, the client now peeks at the leading bytes to detect the format (wav, webm, m4a, mp3) and sets the multipart filename accordingly. This fixes HTTP 400 errors when sending non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses the file extension to determine the audio format.

  • Add DetectAudioExtension using Span.SequenceEqual for readability
  • Add integration tests for all OpenAI-supported formats (mp3, wav, m4a, webm)
  • Add unit tests covering each magic-byte detection branch
  • Add ExpectedAudioFilename assertion to VerbatimMultiPartHttpHandler

Fixes #7543

Microsoft Reviewers: Open in CodeFlow

…#7543)

When the audio stream is not a FileStream, the client now peeks at the
leading bytes to detect the format (wav, webm, m4a, mp3) and sets the
multipart filename accordingly. This fixes HTTP 400 errors when sending
non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses
the file extension to determine the audio format.

- Add DetectAudioExtension using Span.SequenceEqual for readability
- Add integration tests for all OpenAI-supported formats (mp3, wav, m4a, webm)
- Add unit tests covering each magic-byte detection branch
- Add ExpectedAudioFilename assertion to VerbatimMultiPartHttpHandler

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jozkee jozkee requested a review from rogerbarreto June 16, 2026 21:46
@jozkee jozkee self-assigned this Jun 16, 2026
@jozkee jozkee requested a review from a team as a code owner June 16, 2026 21:46
Copilot AI review requested due to automatic review settings June 16, 2026 21:46
@jozkee jozkee added the area-ai Microsoft.Extensions.AI libraries label Jun 16, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates OpenAISpeechToTextClient to auto-detect audio format (wav/webm/m4a/mp3) from leading “magic bytes” when the provided audio stream is not a FileStream, and uses the detected extension in the multipart filename so OpenAI can correctly infer the format (fixing 400s for non-MP3 MemoryStream inputs).

Changes:

  • Add stream-header “magic byte” detection and filename resolution logic in OpenAISpeechToTextClient.
  • Add unit tests validating filename selection for each supported format and branch.
  • Add integration coverage for multiple embedded audio formats and enhance multipart handler assertions to validate the uploaded filename.

Reviewed changes

Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs Adds filename resolution with magic-byte detection for non-FileStream inputs.
test/Libraries/Microsoft.Extensions.AI.OpenAI.Tests/OpenAISpeechToTextClientTests.cs Adds theory-based unit tests asserting detected multipart filenames for different headers.
test/Libraries/Microsoft.Extensions.AI.Integration.Tests/VerbatimMultiPartHttpHandler.cs Adds optional filename assertion for multipart “file” fields.
test/Libraries/Microsoft.Extensions.AI.Integration.Tests/SpeechToTextClientIntegrationTests.cs Adds integration test that exercises auto-detection across multiple audio formats.
test/Libraries/Microsoft.Extensions.AI.Integration.Tests/Microsoft.Extensions.AI.Integration.Tests.csproj Embeds additional audio resource files used by the new integration test.
Comments suppressed due to low confidence (1)

src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs:121

  • In GetStreamingTextAsync, ResolveFilename(audioSpeechStream) is executed unconditionally even for translation requests, but the translation branch immediately delegates to GetTextAsync(...) (which resolves the filename again). With the new magic-byte peek, this results in redundant header reads/rewinds for translation streaming.
        _ = Throw.IfNull(audioSpeechStream);

        string filename = ResolveFilename(audioSpeechStream);

        if (IsTranslationRequest(options))
        {
            foreach (var update in (await GetTextAsync(audioSpeechStream, options, cancellationToken).ConfigureAwait(false)).ToSpeechToTextResponseUpdates())

}

/// <summary>Detects the audio format extension from the leading bytes of the audio data.</summary>
private static string DetectAudioExtension(ReadOnlySpan<byte> header)

@jozkee jozkee Jun 16, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, OpenAI supported formats are: mp3, mp4, mpeg, mpga, m4a, wav, and webm. And quotes from the specs related to the matching occurring in this method:

  1. WAV — RIFF at offset 0, WAVE at offset 8
    Source: Microsoft Multimedia Programming Interface and Data Specifications 1.0 (August 1991), referenced from:
    https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html

Field Length Contents
ckID 4 Chunk ID: "RIFF"
cksize 4 Chunk size: 4+n
WAVEID 4 WAVE ID: "WAVE"

And later under Examples, the full structure shows bytes 0–3 = RIFF, bytes 4–7 = size, and the WAVEID field at bytes 8–11 is WAVE.

  1. MP3 / MPEG / MPGA — ID3 at offset 0, or frame sync 0xFF 0xE_
    Source: http://www.mp3-tech.org/programmer/frame_header.html (authoritative MP3 technical reference, derived from ISO/IEC 11172-3)

Verified citation (exact text):

The first twelve bits (or first eleven bits in the case of the MPEG 2.5 extension) of a frame header are always set to 1 and are called "frame sync".

And the header table shows:

Sign Length (bits) Position (bits) Description
A 11 (31-21) Frame sync (all bits must be set)
11 bits set = bytes 0xFF + top 3 bits of next byte set = (header[1] & 0xE0) == 0xE0

For ID3v2 tags preceding MP3 data:
Source: https://id3.org/id3v2.3.0 — Section 3.1 "ID3v2 header"

"The first three bytes of the tag are always "ID3" to indicate that this is an ID3v2 tag"

  1. MP4 / M4A — ftyp at offset 4
    Source: W3C Note "ISO BMFF Byte Stream Format" (referencing ISO/IEC 14496-12 "ISO Base Media File Format"):
    https://www.w3.org/TR/mse-byte-stream-format-isobmff/

Verified citation (exact text):

An ISO BMFF initialization segment is defined in this specification as a single File Type Box (ftyp) followed by a single Movie Box (moov).

Per ISO 14496-12 box format: bytes 0–3 = box size (uint32 big-endian), bytes 4–7 = box type (FourCC). The first box MUST be ftyp.

  1. WebM — 0x1A 0x45 0xDF 0xA3 at offset 0
    Source: RFC 8794 — "Extensible Binary Meta Language" (IETF Standards Track), Section 8.1 "EBML Header":
    https://www.rfc-editor.org/rfc/rfc8794.txt

Verified citation (exact text from Section 8.1):

The EBML Header MUST contain a single Master Element with an Element Name of "EBML" and Element ID of "0x1A45DFA3" (see Section 11.2.1)

WebM is a profile of Matroska (RFC 9559), which is an EBML Document Type. Every WebM file begins with the EBML Header whose first element has ID 0x1A45DFA3.

@dotnet-comment-bot

Copy link
Copy Markdown
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Diagnostics.Testing Line 99 98.65 🔻
Microsoft.Extensions.Telemetry Line 93 91.95 🔻
Microsoft.Extensions.AI Line 89 88.53 🔻
Microsoft.Extensions.AI Branch 89 88.57 🔻
Microsoft.Extensions.AI.OpenAI Line 75 62.86 🔻
Microsoft.Extensions.AI.OpenAI Branch 75 50.31 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Line 75 4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Branch 75 0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Line 99 96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Branch 99 94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes Line 99 97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns Line 75 69.93 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Line 75 42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Branch 75 42.86 🔻
Microsoft.Extensions.ServiceDiscovery Line 75 67.81 🔻
Microsoft.Extensions.ServiceDiscovery Branch 75 71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Line 75 73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Branch 75 70 🔻
Microsoft.Extensions.VectorData.Abstractions Line 75 37.39 🔻
Microsoft.Extensions.VectorData.Abstractions Branch 75 22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Gen.BuildMetadata 97 100
Microsoft.Gen.MetadataExtractor 57 73
Microsoft.Gen.MetricsReports 67 69
Microsoft.Extensions.AI.Abstractions 82 85
Microsoft.Extensions.AI.Evaluation.NLP 0 78
Microsoft.Extensions.Caching.Hybrid 82 89
Microsoft.Extensions.DataIngestion 75 89
Microsoft.Extensions.DataIngestion.Markdig 75 90
Microsoft.Extensions.Http.Resilience 97 100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1467244&view=codecoverage-tab

int bytesRead = 0;
while (bytesRead < header.Length)
{
int n = audioSpeechStream.Read(header, bytesRead, header.Length - bytesRead);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure the stream is positioned at the beginning to ensure we are reading the header?

int bytesRead = 0;
while (bytesRead < header.Length)
{
int n = audioSpeechStream.Read(header, bytesRead, header.Length - bytesRead);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use Stream.ReadExactly here, or does that prevent you from being able to reliably rewind after reading in an unsuccessful read-exactly scenario?

Maybe Stream.ReadAtLeast might be appropriate though, with throwOnEndOfStream set to false?

But then again, maybe just having this loop here is the cleanest option, as you're not working against what those convenience APIs are trying to do.

}

audioSpeechStream.Position -= bytesRead;
return $"audio.{DetectAudioExtension(header.AsSpan(0, bytesRead))}";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen if we get unrecognized format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai Microsoft.Extensions.AI libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ISpeechToTextClient does not allow to specify the audio FileName

5 participants